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ABSTRACT 

Modern dynamic programming languages often take the ap- 
proach of treating every thing as an object, even primitive 
values like integers or boolean variables. Despite the bene- 
fits brought by this approach, it can be somehow inefficient 
to box an primitive values inside an object. The runtime 
cost of boxing and unboxing can be reduced by using tech- 
niques like tagged pointer and nan boxing. This paper is an 
progress report on the author's ongoing effort to add tagged 
pointer support for a new Python implementation, Pyston. 
It covers the theory and practice of tagged pointers, and 
it discusses the implementation details of such optimisation 
technique. Also, some preliminary results are collected to 
demonstrate the performance improvements. 

1. INTRODUCTION 

Dynamically typed programming languages (E.g. Python, 
Ruby and Javascript) often take the approach of treating 
every construct as an object. This reduces the complexity 
of having two distinctive types of values inside one language, 
objects and primitives. Java is a typical example suffering 
from having built-in primitive types in an object oriented 
language [1]. It is conceptually simpler to have everything as 
an object for programmers, and this approach can often lead 
to more flexible code, e.g. Python has first-class functions 
and this relies on functions being special objects, with a 
builtin method called call . 

To achieve this, implementation of these kind of languages 
often uses a technique called boxing. For example, in CPython 
(the official implementation of Python language), every ob- 
ject is a pointer of type Py Object *. And inside any ob- 
ject there is a specialised els field to help the interpreter to 
track the type of certain object. For user defined objects and 
classes, this approach is fine. But for some primitive types, 

*This paper is my report for Modern Compiler Design 
Course for MPhil Advanced Computer Science at Univer- 
sity of Cambridge. 
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Figure 1: The last three bits of a pointer is always set to 0 
on a 64 bit platform with a word aligning memory allocator. 
So we can use this for type tagging. If the last bit is 0 then 
this is a normal pointer pointed to a object. If the last bit 
is 1 then this is a tagged pointer with some primitive value 
packed into the pointer. Also, if necessary, the first 16 bits of 
a 64-bit pointer can be used for additional type information. 

especially for integer and boolean values, this approach is 
inefficient in two ways. Consider the following implementa- 
tion of integer object in Listing 1. First, it is easy to see that 
treating integers as objects uses more memory. Besides the 
necessary field n, we also need to keep track of the type of an 
object, and when we are using this object we need a pointer 
pointed to this object. So in total the memory cost is higher 
than just having a simple integer. Second, when declaring 
a new integer object, we have to use the new operator, i.e. 
we have to allocate memory for containing this integer on 
the heap. Worse, quite often the language we are trying to 
implemented will embodies a garbage collector, and we have 
to use a customised memory allocator, which can be quite 
complex, for every object creation. The garbage collection 
process also introduces extra time cost to the interpreter 
(see section 2.1 for detail). 
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Listing 1: Simple Integer Object Implementation 



To avoid the overhead introduced, some techniques are de- 
veloped for efficient primitive values in dynamically typed 
languages. One notable technique is called tagged pointer. 



The memory allocator mostly used today often gives back 
word aligned addresses for better performance. On a typi- 
cal 64 bit platform, this means every pointer to an object 
will be a multiple of 8, which means that the last 3 bits of 
the pointer will always be set to 0. Moreover, on a x86-64 
machine, a pointer for now is always at most 48-bit long 
(this roughly gives the processor address space of 281.475 
TB, which can be considered infinite in today's context) [2]. 
Using the above information, one can easily see that there is 
some redundancy that can be utilised to improve the perfor- 
mance of primitive values in dynamic programming language 
implementations. Fig 1 illustrate a common way for optimi- 
sation. Assuming we are on a x86-64 platform, we know for 
sure that the last three bits of a pointer is set to 0. We can 
use the last three bits as a type tag. If the least significant 
bit ("tagging bit") is set to 1, then it means this pointer is a 
special "tagged pointer". The two bits next to the least sig- 
nificant bit can be used as a "type field", e.g. 00 represents 
that a integer is packed inside , 01 represents boolean value 
etc. Then the rest 61 bits can be used for packing values 
inside a pointer. For example, 00. ..001001 means that this 

64 bits 

pointer is a tagged pointer (the last bit is 1), it has the type 
of integer (00 next to the last bit), and it has the value 1 
(the rest part of the pointer). 

This technique is not limited to integers, other primitives like 
boolean values, floating point numbers (32-bit long) can also 
be efficiently packed into pointers if the platform permits. 
However, there are some drawbacks too. First this technique 
is platform dependant and if the language implantation is 
targeting at different hardware platforms, the compiler/in- 
terpreter designer has to pay extra attention when applying 
this optimisation. Second, it is not hard to see that the range 
of a integer is smaller when using tagged pointers, this might 
cause some performance issues in some special cases. 

Also, one thing to notice is that tagged pointers does not 
come with no cost. Despite it removed the need for boxing 
and unboxing, it added the overhead of checking whether a 
pointer is a reference to a object or is a tagged pointer pack- 
ing primitive values at runtime, plus, when doing integer 
arthritics it has the complexity of shifting before computa- 
tion (although the shifting can be avoided using some more 
advanced optimisations). Thus, measuring the performance 
improvement brought by using tagged pointers in dynamic 
language implementation is an interesting topic. 

Pyston is a new open source implementation of Python pro- 
gramming language initiated by Dropbox™and it aims for 
fast speed. It currently does not have tagged pointer builtin, 
and it boxes everything inside an object. Pyston has builtin 
inline cache and JIT compiling support, it has four compila- 
tion tiers. Python code is first parsed into an Abstract Syn- 
tax Tree (AST) , and it is first interpreted with a simple AST 
interpreter. Then if some piece of code gets hot (function 
called too many times, or loop get iterated enough times), 
the AST is compiled down to LLVM IR. Then it is optimised 
with LLVM's built optimiser with effort level from 0 to 3. 
The codebase is relatively small (50,000 LOCs) and is writ- 
ten mostly in C and C++ (with LLVM for code generation 
and JIT optimisations). The author picked this implemen- 
tation as the test vehicle for investigating the performance 



improvements of applying tagged pointers optimisation to 
dynamic programming language implementations. 

The main contributions of this paper is, first, to document 
this technique in a formal form. There has been some doc- 
uments online about using tagged pointers, but none of 
them have been written in a form that is ready to be peer- 
reviewed. This paper serves as a summarisation of this tech- 
nique. Second, it quantifies the performance improvement. 
For three simple integer-arithmetic intensive Python pro- 
grams, experiments has shown there is a 50% speed up by 
using tagged pointers on average. In the rest of the pa- 
per, we will first look at the implantation details of tagged 
pointer in Pyston, then the evaluation of performance im- 
provement will be presented. This paper conclude with a 
look at future work and acknowledgement of other known 
use of tagged pointers, along with other techniques to im- 
prove the performance of primitive values in "all things are 
objects" languages. 

2. DESIGN AND IMPLEMENTATION 

Before diving into the details of implementation of tagged 
pointers in Pyston, the author first present a simple bench- 
mark on Pyston's customised memory allocator. Knowing 
the exact time taken for allocating boxed integer can help 
estimate the performance improvement. 

2.1 Benchmarking the Memory Allocation 
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Figure 2: Time cost of memory allocation in Pyston bench- 
marked: on average allocating one single boxed integer takes 
65 cycles. 

Fig. 2 shows the result of the benchmarking. The bench- 
mark is very simple: it allocate N million boxed integer using 
Pyston's customised allocator. The linux time command is 
used to measure the total runtime of the memory allocation. 
The odd thing is that orange curve, which represents the 
clock cycles per allocation, starts off with very high values 
and then converge to around 65 cycles per allocation. This 
can be explained by the fact that the time command will 
not only count the time of the allocation, it will also count 
Pyston's setup time of GC and threading. Thus, when al- 
locating relatively small number of integers, the overhead 
of set up time will heavily affect the precision of the result. 
But, as we can see from the figure, the curve stays around 65 
cycles when allocating larger number of integers, and we can 
be confident to say that one memory allocation of a boxed 
integer will normally take 65 cycles. 



The point of running this benchmark is that, comparing 
to boxing an integer inside an object on the heap, packing 
an integer inside a 64 bit pointer should be much cheaper. 
Normally the action of packing an integer into the pointer 
should only take 2 cycles (a bit shift, and then an or oper- 
ation). And unpacking should take 1 cycle (a simple shift). 
This is discussed in more details in the following sections. 



5 jne ic_fail ; call the IC fail function. 
t ; type checking on arguments 
9 ; type checking passed 

10 jmp intAddlnt ; call intAddlnt function 
Listing 3: Inline Cache without Tagged Pointers 



2.2 Design 

The core of the tagged pointer in Pyston is listed in Listing 
2. The author aims for the lightest implementation possi- 
ble so no new types (classes) is introduced to the current 
implementation. All tagged pointer has the type Box *, as- 
sociated with a set of C marcos for the use of tagged pointer 
in the runtime. 

First, for now we aim only at packing integers inside a 
pointer, so _TP_SHIFT is defined as 1 for simplicity (i.e. there 
is no type field right now). Then the following two marcos 
are just some helper functions used in the runtime for de- 
termining the type of a pointer, and packing and unpacking 
integers inside a pointer. According to a benchmark de- 
veloped by Torbjorn Granlund [5], the packing instruction 
should take at most 2 clock cycles on a x86-64 platform and 
the unpacking should take only 1 clock cycle. 

#define _TP_SHIFT 1 

#define TP_ ISINT ( var ) (bool)(var & 1) 
#define TP_PACKINT ( var ) (var << _TP_SHIFT) 
I 0x1) 

#define TP_GETINT ( var ) (var >> _TP_SHIFT) 
#define TP_GETCLASS (obj ) ( TP_ ISINT ( obj ) ? 
int_cls : obj->cls) 

Listing 2: Tagged Pointer (Some Type Conversion Are 
Omitted) 

In Python, all field access and method calls of an object are 
resolved down to the class level. For example, obj 1 .method (obj 2) 
in Python is just a syntactic sugar for class .method (obj 1 , obj 2) . 
Also, the interpreter relies on the els field for tracking the 
type of an object. So in the language runtime system there 
are numerous access of the els field of an object. This would 
have been fine if the object being dealt with is a actual 
pointer to a memory location, however, if the object is a 
tagged pointer, then accessing the els field directly would 
cause segmentation error. So, the marco TP_GETCLASS is 
needed to replace all obj->cls in the implementation. 

2.3 Implementation 

Migrating current Pyston codebase to use tagged pointers 
is conceptually easy, we just need to modify the built-in 
integer module and associated parts of the interpreter. For 
every occurrence of Box * in the codebase that is related to 
the integer class, changes have to be made to use the marcos 
listed in last section. This is easy on the AST interpreter 
level. However, for the Inline Cache (IC) generated machine 
code, it is tricker. 

load rO , pointer_to_ob j 

add rO , of f set _of _ els ; load rO wit obj -> 
els 

cmp rO , int_cls ; test obj->cls == int_cls 
; type checking failed 



Python, among some other programming languages, uses 
runtime method binding. This means the attribute/method 
being called of an object is only determined at the runtime. 
When a method of an object is invoked, the runtime sys- 
tem will look the method up using the type of the object, 
the name of the attribute/method and the types of the ar- 
guments. This process is an overhead since in the lookup 
process often is quite time consuming and complex (espe- 
cially so with Python's descriptor logic). Also, in most cases, 
based on empirical observation, the method being invoked at 
a certain call site tends to be the same method. So this over- 
head can be avoided by caching up the result of the method 
lookup "inline" at every call site [3]. This means during the 
method lookup procedure the runtime system will generate 
machine code that does type checking and caching up the 
result of the method lookup, and when next time the inter- 
preter reaches this call site, the generated binaries will first 
check whether the type signature of the arguments presented 
fit the cached up method, if the type check failed, then the 
runtime system will fall back on the normal method lookup 
routine. Otherwise, the runtime will jump to the cached ad- 
dress directly. Listing 3 shows an example of inline cache. 
Python code n - 1 is just a shorthand for method invoca- 
tion n. sub (1), which is implemented by a C function 

intSubint internally. And when interpreting this code, the 
runtime system will first invoke the method lookup routine, 
and generate the binaries in Listing 3. When next time 
the interpreter encounter n - 1 again, the binaries is exe- 
cuted. It will first check the object's type is an integer, and 
then check the argument's type is an integer too. If all type 
checking passed, then the runtime system will jump to the 
intSubint function directly. 

It is easy to see that in Listing 3, the way IC generated 
machine code access els of an object is equivalent to the 
C code obj->cls. Again this is problematic with tagged 
pointers. Pyston provides a simple assembler for generating 
machine code for IC. To accompany tagged pointers, the 
author useed the assembler to generate code that emulates 
the functionality of marco TP_GETCLASS during the runtime. 

The internal integer module also has to be changed. Intu- 
itively, functions like Box* intAddlnt (Boxlnt* lhs , Box- 
Int* rhs) should be as simple as one line: 

return TP_PACKINT(TP_GETINT(lhs) + TP_GETTNT(rhs) ) . 

There are two problems with this, first, it does not check 
integer overflow. Second, this is not the most efficient im- 
plementations. This implementation will require at least 5 
instructions (two shift, one add, and then a shift again, see 
Listing 4). Also, although checking the overflow of the addi- 
tion of two 64-bit long integers is efficient, often can be done 
by checking the CPU's internal carry register or simply look- 
ing at the sign bit of the result, when using tagged pointers 



this can be harder since there is no hardware support and 
we have to manually emulate the process. David Chisnall 
has documented an more efficient way implement addition 
for tagged pointers [4] . Instead of shifting the pointers to get 
integers, we first clear the last bit of b, then when adding 
a and b together, since the last bit of b is 0 and the last 
bit of a is 1 then the last bit of the result of the addition 
will be 1, and the rest part of the result will hold the sum 
of the two integers packed inside the two pointers. Better, 
this trick will remove the need for software check of overflow, 
since we can just check the carry bit of the CPU for overflow 
detection. 

// The intuitive implementation 
a >>= 1; 
b >>= 1; 
c = a + b ; 

/* Check for overflow here */ 
return (c << 1) I 1; 

//The more advanced implementation 
b = b & ~1 ; 
return a + b; 

Listing 4: Tagged Pointer Integer Addition 



Note, this trick can also be applied to subtraction, with 
minor modification. Similar trick can be applied to integer 
multiplication, the author decide not to repeat what has 
been documented by Chisnall here, the reader can consult 
his webpage if needed. Also, this trick is not applicable if 
there is we use "type tag" in the implementation. 

3. EVALUATION 

As discussed before, there are two main benefits of using 
tagged pointers for primitive values in dynamic program- 
ming language implementations. First this optimisation can 
help speed up operations on primitive values (integer arith- 
metic in this special case) . Second it should reduce the mem- 
ory usage of primitive values in the interpreter. To verify 
these assumptions, two experiments are carried out, the first 
is on execution time and the second is on memory usage. 

3.1 Execution Time 

Fig. 3a shows the result of running three integer arithmetic 
intensive Python program with Pyston, with and without 
tagged pointer support. Matrixmul is a simple program that 
multiply two 100 by 100 matrices 25 times, fib is a for 
calculating Fibonacci sequence, and sum is for summing up 
a large integer array. On average, with tagged pointer the 
program runs 51% faster than the original version. 

Note that these benchmarks are designed to fully exploit the 
possible speed up. Real world program may not have high 
density integer operations, and thus may not benefit so much 
from tagged pointers. However, the aim of the benchmark 
is to measure the efficiency brought by the optimisation, so 
focusing on integer arithmetic is a good choice. 

3.2 Memory Usage 

Memory usage is evaluated using a benchmark that allocates 
a large array of integers. Results are presented in Fig. 3b. 



Measuring the memory usage of Pyston when interpreting 
some Python code is not as simple as using the time com- 
mand. The author tried to use valgrind first and then 
realised it only measure memories being allocated on the 
heap. Later the author turned to use \usr\bin\time -v to 
look at the "maximum resident set size" of a program. This 
term means the total memory being allocated by the OS for 
the interpreter in the memory, but not counting any extra 
swap space on the disk. Considering the memory usage is 
relatively small, this figure should be precise enough for the 
purpose of measuring the difference before and after apply- 
ing tagged pointers to Pyston. 

From the figure, we can see that after adding tagged pinter 
support, the memory usage of declaring big integer array is 
significantly smaller. Again, the author has to emphasis that 
this is designed to exploit the benefit of the optimisation. 
For real world applications the memory usage decrase should 
be smaller. 

4. CONCLUSIONS AND FUTURE WORK 

In general, we can see tagged pointer can bring significant 
performance improvement, both for execution time and mem- 
ory usage, to the current Pyston implementation. Although 
this optimisation is effective, there remains some other as- 
pects that the author has not looked into due to time limit. 
These are: 

• Subclassing and packing other primitive values: the 
current implementation of tagged pointers only sup- 
port one primitive type, however other primitive val- 
ues, like boolean variables can also be packed into 
tagged pointers. Also, Python semantic 1 permits in- 
heritance from primitive values, but the current imple- 
mentation is limited in this respect. One way to solve 
this issue is to use type field mentioned earlier, but 
this solution is not perfect in the sense that it limits 
the number of subclasses of the integer class. More 
careful design is needed. 

• More serious testing and benchmarking: Pyston pro- 
vides a comprehensive test suite and a number of bench- 
marks. Currently there are still some minor bugs in the 
current implementation of tagged pointers (for exam- 
ple, it lacks full support for integers in range statement 
and support for arithmetic with small integers and long 
integers is limited etc.) and some of the test are still 
failing. The author plans to gradually eliminate these 
bugs, and refactor the tagged pointer module into a 
C++ class with a set of more rigorously designed API. 
Also, Pyston's performance benchmarks are more real 
world than the benchmarks used in this paper, and 
testing the performance boost using them is planned. 

• C API support: Pyston, like CPython, tries to sup- 
port extending Python program with C, and it aims 
to support the same C API as CPython. The problem 
is that the C API CPython provides treats everything 
as an object. So there should be an abstraction layer 

1 Python does not have a specification, so when talking about 
Python semantic the author is actually referring to the se- 
mantic of CPython implementation. 
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Figure 3: Evaluation Results 



between the C API and the underlying Pyston imple- 
mentation if tagged pointer is used. 

In general, the current implementation of tagged pointer is 
not ready to be merged with Pyston, but the author plans 
to continually work on it. 
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Tagged pointer has been implemented in the runtime sys- 
tem for Objective C in iOS by Apple. JavascriptCore used 
to use it, and Google's V8 engine still use it now. How- 
ever, there are also some other techniques for implementing 
efficient primitive values in language runtimes, notably the 
technique called "nan-boxing". There are plenty materials 
online describing the technique, so the author would not re- 
peat them here. 
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